Student : Esteban Ordenes

Post Graduate Program in Data Science and Business Analytics

PGP-DSBA-UTA-Dec20-A

CreditCard Users Churn Prediction

Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

Objective

Thera bank need to come up with a classification model that will help the bank improve their services so that customers do not renounce their credit cards.

Data Dictionary

Loading libraries

Read the dataset

View the first and last 5 rows of the dataset.

Understand the shape of the dataset.

Check data types and number of non-null values for each column.

Summary of the dataset

Converting the data type of categorical features to 'category'

Lets Evaluate the Dependant Variable - Attrition_Flag

EDA

Univariate analysis

Observation on Customer_Age

Observation on Months_on_book

Observation on Total_Revolving_Bal

Observation on Total_Trans_Amt

Observation on Total_Trans_Ct

Observation on Credit_Limit

Observation on Avg_Open_To_Buy

Observation on Total_Amt_Chng_Q4_Q1

Observation on Total_Ct_Chng_Q4_Q1

Observation on Avg_Utilization_Ratio

Observations on Attrition_Flag (Dependant Variable)

Observations on Gender

Observations on Education_Level

Observations on Marital_Status

Observations on Income_Category

Observations on Card_Category

Observations on Dependent_count

Observations on Total_Relationship_Count

Observations on Months_Inactive_12_mon

Observations on Contacts_Count_12_mon

Bivariate Analysis

Bivariate analysis every possible attribute pair in relation to Attrition_Flag

Customer_Age vs Attrition_Flag

Months_on_book vs Attrition_Flag

Total_Revolving_Bal vs Attrition_Flag

Total_Trans_Amt vs Attrition_Flag

Total_Trans_Ct vs Attrition_Flag

Credit_Limit vs Attrition_Flag

Avg_Open_To_Buy vs Attrition_Flag

Total_Amt_Chng_Q4_Q1 vs Attrition_Flag

Total_Ct_Chng_Q4_Q1 vs Attrition_Flag

Avg_Utilization_Ratio vs Attrition_Flag

Gender vs Attrition_Flag

Education_Level vs Attrition_Flag

Marital_Status vs Attrition_Flag

Income_Category vs Attrition_Flag

Card_Category vs Attrition_Flag

Dependent_count vs Attrition_Flag

Total_Relationship_Count vs Attrition_Flag

Months_Inactive_12_mon vs Attrition_Flag

Contacts_Count_12_mon vs Attrition_Flag

Multivariate analysis

Analys variables with good and high correlation with regards to Attrition_Flag.

Data Pre-processing

Outlier treatment

Missing-Value Treatment

KNN Imputer

Split Data

Imputing Missing Values

Encoding categorical varaibles

Building the model

Model evaluation criterion:

Model can make wrong predictions as:

Most importantly,

In order to reduce this loss, we need to reduce False Negatives

Logistic Regression

Evaluate the model performance by using KFold and cross_val_score

Oversampling train data using SMOTE

Logistic Regression on oversampled data

Regularization

Undersampling train data using SMOTE

Logistic Regression on undersampled data

Let's evaluate the model performance by using KFold and cross_val_score

Performance on the test set

Building different models using KFold and cross_val_score with pipelines and tune the best model using GridSearchCV and RandomizedSearchCV

Hyperparameter Tuning

XGBoost

GridSearchCV

RandomizedSearchCV

GradientBoost

GridSearchCV

RandomizedSearchCV

AdaBoost

GridSearchCV

RandomizedSearchCV

Comparing all models

Feature importance from the tuned xgboost model

Business Recommendation

End-of-File